R is a versatile coding language for data science, with a wonderful community supporting it. Here’s a short list of some of the things that make R great.
Free and open source It’s a free and open source programming language and environment for statistical computing, machine learning, and graphics.
Reproducibility and Reporting writing reproducible reports is now easier than ever thanks to packages like knitr and R Markdown.
RStudio RStudio is a powerful Interactive Development Environment that has made learning R and using R much easier. With options for workflow and project management.
Graphics. R can be used to make great data graphics, with packages like ggplot2 helping users make graphics in an intuitive way.
R Packages and Community With over 15,000 packages on CRAN alone, there’s pretty much a package to do anything. The greater R community has also expanded tremendously over time, bringing in new users and pushing R to be useful in more applications. Each year there are thousands of meetups, conferences, seminars, and workshops on R all around the world.
Familiarise yourself with RStudio and R Notebooks, which is what we’ll use to interact with R.
Learn about the simple data structures in R: object, vector, and data frame.
Explore R’s basic data types = integer, character, numeric, etc.
Learn to read data into R.
Introduction to data wrangling using the tidyverse set of metapackages.
Use the tidyverse verbs to explore the gapminder data set which includes statistics for countries around the world including life expectancy, population, and GDP per capita.
Learn to merge datasets using left_join.
Create meaningful visualisations of the data using ggplot2.
Learn where to go for help.
First let’s set it so that our notebook shows up in our viewer.
In this training we will be using RStudio. RStudio is an interactive development environment (IDE) for R and is broken down into various panels for our convenience.
Source: R-Ladies Sydney Basic Basics.
Some people like RStudio to remember stuff from session to session. However, this can be dangerous as previous work and packages can interfere with current code and make your code more breakable. To avoid this, it is recommended that you change two settings in RStudio.
Source: R-Ladies Sydney Basic Basics.
Locate Preferences (On Windows, this is in the Tools->Global Options menu; on a Mac, this is in the RStudio menu). In the General tab, uncheck “Restore .RData…” and select “never” for “Save workspace…”
R Notebooks give the opportunity to combine code and description in a single human-readable notebook. You can conduct analysis and give interpretation side-by-side! This means that your entire analytical approach can be documented together, from the raw data to the analysis and finally results and conclusions.
We will be entering the R code into these blocks:
print('code goes here!')
## [1] "code goes here!"
We can run the block of code using the play button on the right. We can also run this block of code with all previous blocks of code with the downwards facing play button in the middle.
In some places I have added additional arguments to the code chunk (e.g. eval = FALSE) so that something is not evaluated in order for the html file to compile. See the example below:
print('code goes here!')
Feel free to change this by simply removing the , eval = FALSE especially as you update the document. However, note that if there are any code errors left, the html file will not compile.
You can add comments within your R code chunk using #.
You can comment or uncomment code using Ctrl + Shift + C.
You can run a line of code by placing your cursor anywhere on the line and using Ctrl + Enter. This will execute the line of code and move the cursor to the next line.
Let’s start by making an assignment and inspecting the object we created.
x <- 10*5
x
## [1] 50
All R statements where you create objects by making an ‘assignment’, take the form:
object_name <- valueYou can think of objects as storage containers for values. An object is created using the operator <-. It can be a pain to type <-, but don’t be tempted to use = as this has another specific use in the R language.
You can name your objects anything. You can use letters, numbers, periods and underscores. You just can’t start names with a dot or a number 1,2,3... and your name cannot contain other characters such as a comma or a space.
this_works <- 10*5
this_works
## [1] 50
Try running the following lines of code. Try uncommenting the code # this_doesn't_work <- 10*5 by clicking on the line and using Ctrl + Shift + C.
# this_doesn't_work <- 10*5
It is useful for future you and your collaborators to name your objects something that is reasonable and describes what the object contains. To make your object names easy to read it is useful to adopt a convention for demarcating words in names.
jenny_bryan_and_hadley_wickham_use_snake_case
some.people.use.periods
othersUseCamelCase
Make a new object
a_very_long_name <- 7^2
Sometimes to make our object names readable we use long names that can be labourious to type. Luckily, RStudio has a handy completion facility.
Start by typing the first few letters of a_very... in the code chunk below and type TAB to complete the name.
a_very_long_name
## [1] 49
Let’s try inspecting the object again.
# What happens if you run:
a_vry_long_name
A_very_long_name
R is very sensitive to both case and spelling mistakes and won’t run unless things are spelled correctly and are in the right case. If you get an error, check your spelling! More than 80% of the time, this is likely the cause of your error!
A vector is a 1-dimensional ordered collection of elements, all of the same type. It is the fundamental data structure in R with a lot of useful properties.
We can extract an element from a vector by referencing its position. Let’s make a new vector called character_vector using the function c() which can be used to c()ombine elements.
c()## Defining the character vector:
character_vector <- c("ET", "Phone", "Home", "ET", "Phone", "Home")
Notice that when we specify words or characters, we use "".
str()str(character_vector)
## chr [1:6] "ET" "Phone" "Home" "ET" "Phone" "Home"
R is able to recognise, thanks to the "" around our text that the vector contains a character string chr.
length()length(character_vector)
## [1] 6
:.character_vector[3:5]
## [1] "Home" "ET" "Phone"
<-Try replacing the 4th element with your name:
character_vector[4] <- "Laurie"
character_vector
## [1] "ET" "Phone" "Home" "Laurie" "Phone" "Home"
The same method used to extract information works for any type of vector. Here we can define a new vector numeric_vector containing the numbers 1, 2, 3, 4, and 5.
numeric_vector <- c(1:5) # c() is a function to
str()str(numeric_vector)
## int [1:5] 1 2 3 4 5
Because we have specified whole numbers, R can either classify the vector as and integer int or as numeric num.
numeric_vector[1:2]
## [1] 1 2
c()Trying uncommenting and running the line below:
# numeric_vector[1,3]
Note that we can only select the 1 and 3 or 1, 3, and 4 elements using c().
numeric_vector[c(1,3:4)]
## [1] 1 3 4
Let’s make a second numeric vector.
numeric_vector2 <- c(1.1,3:4)
## Check the structure
str(numeric_vector2)
## num [1:3] 1.1 3 4
You’ll notice that now when we check the structure, the vector is numeric (num). This is because we now have a number with a decimal place.
R is what is known in computer science as a dynamically typed language. R doesn’t require you to set the data type when you create a vector, instead it figures out what the best data type is for the object you are creating - numeric, character, factor, logical, etc.
However, sometimes the data type you want to work with, and the one R infers are not the same. You can change the data type using a range of in-built functions that enable you to convert data from one type to another.
as. functionsA useful set of functions are the as. functions, which take the form as.<structure>. We can use this to specify the structure of our numeric vector to be numeric.
numeric_vector <- as.numeric(numeric_vector)
str(numeric_vector)
## num [1:5] 1 2 3 4 5
The structure of vectors becomes important when we use it to analyse different things.
character_vector <- as.factor(character_vector)
str(character_vector)
## Factor w/ 4 levels "ET","Home","Laurie",..: 1 4 2 3 4 2
Note that now character_vector is now classed as a factor Factor with 4 levels: “ET”, “Home”, “Laurie”, and “Phone”.
When you create a factor it uses an integer code to represent each level. So that “ET” is both “ET” and 1, “Home” is both “Home” and 2. You’ll notice that it automatically takes the alphabetic order when determining the factor levels. This means that even though “Phone” occurs 2nd in our character vector, it gets the integer code: 4. This is just a detail now, but becomes important in plotting, especially if you want to change the order in which your factors are plotted.
Factors are especially useful if we want to group data by a factor (e.g. country) for counting or summarising. For instance, “Home” and “Phone” each occur twice, whereas “Laurie” and “ET” each only occur once.
Vectors aren’t just containers for homogeneous data. As R is a vectorised language, this means operations are applied to each element of the vector automatically, without the need to loop through the vector.
This is powerful as at a low-level as computer chips are generally optimised for these types of calculations SIMD.
Let’s look at some examples
numeric_vector
## [1] 1 2 3 4 5
numeric_vector*3
## [1] 3 6 9 12 15
numeric_vector^2
## [1] 1 4 9 16 25
You can also multiply, divide, add, and subtract vectors of the same length.
x <- seq(from = 1, to = 20, by = 4)
x
## [1] 1 5 9 13 17
numeric_vector/x
## [1] 1.0000000 0.4000000 0.3333333 0.3076923 0.2941176
What happens when you run the following line of code?
x - numeric_vector
## [1] 0 3 6 9 12
Fill in the code chunks to answer the following questions
numeric_vector[]
Hint: you can use length() to find out how many elements there are in the character vector.
character_vector[]
numeric_vector
y <- c(5:1)
w <- c(1:4)
numeric_vector/w
So far, we’ve seen R’s capabilities as a large calculator and also as a place for storing objects and vectors. However, it is much much more than that! One of the things that makes R amazing is the open source community surrounding it.
The R community which is made up of academics, statisticians, social and political scientists, economists, and data scientists to name a few, are responsible for authoring a wide variety of packages (>15,000) that can do a wide range of data manipulation, visualisation, and analysis tasks.
To get your head around what CRAN, library, packages, and functions are I find it helpful to think of books.
CRAN stands for the Comprehensive R Archive Network. It’s like the R equivalent of the British Library or Library of Congress. It holds a copy of every package (book) and all the versions of R.
On your computer you’ll have a local library with copies of the packages you’ve installed from CRAN (your home office book shelf).
Click on the ‘Packages’ tab in the lower right hand panel (Q4 from before). You can see what packages are in your library, a short description of what they do, and the package version.
The packages that are loaded have a check mark in the box on the left. As before, there are several packages that are automatically loaded each time you start an R session, e.g. base package.
Although it is possible to load and install your packages from here, I recommend using the functions shown below instead. This way, someoneelse or future you knows exactly what packages they need to run the analyses.
You should load the packages you will use at the top of your script, so that future you or your colleague knows what needs to be installed/loaded.
A package needs to be installed only once and requires an internet connection which allows your computer to communicate with the CRAN server.
You may wish to install a package with the additional argument: dependencies = TRUE, this will also install any packages that the package depends on.
On your personal computer, you can install a package to your local library from CRAN by uncommenting and running the following:
# install.packages("tidyverse")
# install.packages("ggplot2", dependencies = TRUE)
However, if you are on a government laptop without elevated access rights, read further…
On your government laptop, you will need to put in a Service Desk Software Request for any packages you want installed.
As a standard user, you are unable to run R packages that you download as it installs them to your Documents folder. Because of restrictions on the government laptops, it is then unable to run the package from this location because the DLL files it contains are blocked.
As a result, the R installation often comes with many of the packages you’ll need pre-installed. For any other packages you wish to install, you can put in a Service Desk request.
If you do install packages yourself, it is highly likely that you will get this error if you install packages and load them from the internet.
“Error: package or namespace load failed for ‘ggplot2’ in inDL(x, as.logical(local), as.logical(now), …): unable to load shared object ‘C:/Users/l-baker/Documents/R/win-library/3.6/rlang/libs/x64/rlang.dll’: LoadLibrary failure: This program is blocked by group policy. For more information, contact your system administrator.”
If you do try to install packages and if you get the above error you can fix it by deleting your R folder from Documents. R will then return to looking for packages that come supplied with your department’s R distribution.
In order to use the package you need to load it to your workspace. This needs to be done each time you start a new RStudio session or project. Think of it as taking the book you will use off your book shelf to place next to you on the desk.
In this case, tidyverse is a meta-package, which actually contains several individual packages including dplyr, forcats, etc., but more on those later. The tidyverse metapackage is in your library already so we can simply make a call to load them.
library(tidyverse)
## Warning: package 'tidyverse' was built under R version 3.6.2
## -- Attaching packages ---------- tidyverse 1.3.0 --
## v ggplot2 3.2.1 v purrr 0.3.3
## v tibble 2.1.3 v dplyr 0.8.3
## v tidyr 1.0.0 v stringr 1.4.0
## v readr 1.3.1 v forcats 0.4.0
## Warning: package 'ggplot2' was built under R version 3.6.2
## -- Conflicts ------------- tidyverse_conflicts() --
## x dplyr::filter() masks stats::filter()
## x dplyr::lag() masks stats::lag()
Alas, there are not enough names to make each function in every package unique. The “Conflicts” line that is printed tells us that the dplyr function filter will mask the stats package function filter.
If we want to be completely accurate, we can specify the package and function using the following form <package_name>::<function_name>, e.g. dplyr::filter().
If we follow the recipe book analogy, this is like saying we want the lasagna recipe from jamie_oliver::lasagna so that it isn’t confused with the nigella_lawson::lasagna recipe.
You can think of a package like a book on a particular subject. Each package is designed to do a specific set of tasks (e.g. data manipulation, implement linear models, draw geographical maps, etc.). Each task is implemented using a function, which is a set of statements organised to complete the task.
A function is like a recipe from a book. It is designed to make one specific thing, e.g. cupcakes or steak and kidney pie. The function takes arguments (e.g. ingredients) and then carries out a series of steps where the ingredients are modified, cooked, combined, etc. to create the final recipe.
Some of these arguments will be optional (e.g. add or don’t add cinnamon), whereas other arguments will be required for the function to run (e.g. you can’t make the cake without flour!).
Functions follow the form:
functionName(argument1 = value1, argument2 = value2, and so on)Let’s take a look at some of the built-in functions R has for carrying out basic statistics/analysis, starting with seq().
seq() functionLet’s try using seq() which makes regular sequences of numbers and, while we’re at it, demo more helpful features of RStudio.
se and hit TAB. A pop up shows you possible completions.se
Specify seq() by typing more to specify the function or using the up/down arrows to select. Notice the floating help box that pops up to remind you of the function’s arguments.
If you want even more help, press F1 as directed to get the full documentation in the help tab of the lower right pane. You can also access the help file for a function by typing ?seq.
Now open the parentheses and notice the automatic addition of the closing parenthesis and the placement of cursor in the middle. Type the arguments 1, 10 and hit return. RStudio also exits the parenthetical expression for you.
seq(1,10)
## [1] 1 2 3 4 5 6 7 8 9 10
Let’s take a closer look at the help file for seq().
?seq
Every help file will have a series of sections describing what the function does. I generally focus first on: Description, Usage, Arguments, and Examples.
For example, in the helpfile for seq() under Description, it tells us it is a function to “Generate regular sequences”.
We can see that seq() takes the arguments from, to, and by, and the optional arguments length.out and along.with.
Here, we can find out what these arguments are:
from, to: the starting and maximal end values of the sequence.by number: increment of the sequence.In the code we used above in sequence, we generated a sequence of numbers from 1 to 10. In this case we did not supply a value for by, so it took the default value, which in this case is 1.
What happens if we try:
seq(10,1)
And what about:
seq(to = 10, from = 1)
The above demonstrates something about how R resolves function arguments. You can always specify in name = value form. But if you do not, R attempts to resolve by position.
So above, first it is assumed that we want a sequence from = 1 that goes to = 10. Then if we swap the numbers it is assumed we want to sequence from = 10 that goes to = 1. If we name the arguments explicitly using name = value, the order of the arguments doesn’t matter.
If we want to store our output in an object and see it in the same line, we can use:
(y <- seq(from = 1, to = 10))
Let’s take a look at our workspace and showcase a function that doesn’t require any arguments.
ls()
If you want to remove the vector name y you can use
rm(y)
If you want to remove everything in your workspace you can use:
rm(list = ls())
You may want to do this at the end of an analysis before you start on another project.
Anytime your data is rectangular, spreadsheet-like data, the default data format in R is a data frame. Data frames can hold variables of different types. Where each column of the data, is essentially a vector, such as numeric data (GDP), character data (country name), and categorical information (infected vs. uninfected).
Data frames are extremely useful and many functions are set up to take a data frame for the data = argument. The tidyverse packages, which include dplyr and ggplot2 work with a special type of data frame, called a “tibble”.
Our data comes from the Gapminder foundation, an organization dedicated to educating the public by using data to dispel common myths about the so-called developing world. The dataset we will use is one that has been combined from the gapminder data set from the gapminder package, and the gapminder data set from the dslabs package.
Before you read in a data file you want to ask yourself two questions:
In this case, we are going to read in a .csv (comma separated value) file called gapminder.csv.
The tidyverse comes with a number of useful functions for reading in data. For some of the most common files you work with you can use:
read_csv: reads in a csv fileread_excel: from the readxl package reads in an excel file (.xls and .xlsx). Possible to add the sheet number or name you wish to extract. Check out the arguments in the helpfile using ?read_excel.and much, much, more! If you are looking for another file type I highly recommend checking out this section from Jenny Bryan’s UBC stats course Stat545 Import and Export or looking more generally into the readr package. There are nice options for removing lines of meta data (e.g. rambles at the head of an excel spreadsheet) and other options for messier data frames.
read_ functionsThe functions for reading in the data take the same basic form
my_file <- file.path("data", "gapminder.csv")
gapminder <- read_csv(file = my_file)
First you need to specify the name of the data frame you want to store your data in.
Then you specify the file name (don’t forget the file format e.g. .csv) and the location where it is stored in quotes.
In this case the file is stored in the folder “data” which is part of the IntroR course master folder you were sent. Here, you’ll notice that we are using a relative path, that is the location of the data is specified in relation to our script file. Relative paths are especially useful because they will work across all operating systems and unlike a “hard path”, e.g. C:/Users/l-baker/repos/The_faculty/IntroR4IntlDev", this relative path will work on anyone’s computer, not just my own!
For a ‘relative path’ to work, we need to get to the right directory (location where our script file is stored). You can do this using the RStudio menu:
“Session -> Set Working Directory -> To Source File Location”. In this case this will set the working directory to the location where the script file: “IntroR4IntlDev.Rmd” is stored.
Alternatively, you can use setwd("C:/Users/l-baker/repos/The_faculty/IntroR4IntlDev") and give it the file path where your script file is located.
To find out where you are you can use the function getwd() which stands for “get working directory”.
getwd()
One of the good practices of coding is to never use absolute or “hard paths”. Just because your script tells a colleague what subfolder the data is kept in on your computer, does not help them reproduce the code, especially as a hard path only works for your computer.
The advantage of “relative paths” is that they will work across operating systems and across anyone’s computer. For each project, it is best practice to set up a folder for that project with your script file and subfolders to store the “figures” and the “data”.
In sharing code, you share the whole master folder complete with the figure and data subfolders. Then as long as they set the working directory to the location of your script file, they can run your script with little trouble accessing the figures and data needed from the relative paths specified.
my_file <- file.path("data", "gapminder.csv")
gapminder <- read_csv(file = my_file)
## Parsed with column specification:
## cols(
## country = col_character(),
## continent = col_character(),
## year = col_double(),
## lifeExp = col_double(),
## pop = col_double(),
## gdpPercap = col_double(),
## infant_mortality = col_double(),
## fertility = col_double()
## )
The data contains 8 columns
For each of the three pairs of countries below, which country do you think had the highest infant mortality rates in 2007? Which pairs do you think are the most similar?
Sri Lanka or Turkey
Poland or Malaysia
Pakistan or Vietnam
Which of the two pairs of countries do you think had the highest life expectancy in 2007. Which are the most similar?
South Africa or Yemen
Chile or Hungary
For the two pairs of countries below, which country do you think had the highest gdpPercap in 2007?
Switzerland or Kuwait
Colombia or Nepal
There are several tools to get to know our data.
View(): allows us to view the data frame as a spreadsheet.nrow(): tells us the number of rows in our data frame.names(): gives us the names of the columns in our data frame.dim(): tells us the dimensions of our data frame.summary(): give us summary statistics (counts, min, median, mean, max).head(): gives us the first 6 elements of the data.tail(): gives us the last 6 elements of the data.str(): tells us the variable type (e.g. Factor, num (number), int (integer)).unique(): tells us the unique elements of a variable.Let’s take a look at head and View to inspect the data more closely.
head(gapminder)
## # A tibble: 6 x 8
## country continent year lifeExp pop gdpPercap infant_mortality
## <chr> <chr> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 Afghan~ Asia 1952 28.8 8.43e6 779. NA
## 2 Afghan~ Asia 1957 30.3 9.24e6 821. NA
## 3 Afghan~ Asia 1962 32.0 1.03e7 853. NA
## 4 Afghan~ Asia 1967 34.0 1.15e7 836. NA
## 5 Afghan~ Asia 1972 36.1 1.31e7 740. NA
## 6 Afghan~ Asia 1977 38.4 1.49e7 786. NA
## # ... with 1 more variable: fertility <dbl>
View(gapminder)
From viewing the data we can see that the data contains eight variables: country, continent, year, lifeExp, pop, gdpPercap, infant_mortality, and fertility.
Click “Filter” in the View menu, you can use this similarly to how you would interact with the data in Excel.
Exercise
Using filter, what was the life expectancy in Rwanda in 1952?
Which country had the highest infant mortality rate? What was the year?
We’ve already used str() to explore our vectors, we can also use it to take a look at our data frame to tell us what type of variables we have.
str(gapminder)
## Classes 'spec_tbl_df', 'tbl_df', 'tbl' and 'data.frame': 1704 obs. of 8 variables:
## $ country : chr "Afghanistan" "Afghanistan" "Afghanistan" "Afghanistan" ...
## $ continent : chr "Asia" "Asia" "Asia" "Asia" ...
## $ year : num 1952 1957 1962 1967 1972 ...
## $ lifeExp : num 28.8 30.3 32 34 36.1 ...
## $ pop : num 8425333 9240934 10267083 11537966 13079460 ...
## $ gdpPercap : num 779 821 853 836 740 ...
## $ infant_mortality: num NA NA NA NA NA NA NA NA NA NA ...
## $ fertility : num NA NA NA NA NA NA NA NA NA NA ...
## - attr(*, "spec")=
## .. cols(
## .. country = col_character(),
## .. continent = col_character(),
## .. year = col_double(),
## .. lifeExp = col_double(),
## .. pop = col_double(),
## .. gdpPercap = col_double(),
## .. infant_mortality = col_double(),
## .. fertility = col_double()
## .. )
In this case country and continent are characters, year, lifeExp, pop, gdpPercap, infant_mortality and fertility are numbers.
You’ll notice from the preview that both infant_mortality and fertility have some NAs. NAs are commonly used to show that there is no data for a given year and variable.
One of the first things we are going to do is change the columns country and continent to factors, as we can treat them as categorical variables (i.e. they indicate a category that data belongs to). We can do this using the as.factor function.
gapminder$country <- as.factor(gapminder$country)
gapminder$continent <- as.factor(gapminder$continent)
str(gapminder)
## Classes 'spec_tbl_df', 'tbl_df', 'tbl' and 'data.frame': 1704 obs. of 8 variables:
## $ country : Factor w/ 142 levels "Afghanistan",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ continent : Factor w/ 5 levels "Africa","Americas",..: 3 3 3 3 3 3 3 3 3 3 ...
## $ year : num 1952 1957 1962 1967 1972 ...
## $ lifeExp : num 28.8 30.3 32 34 36.1 ...
## $ pop : num 8425333 9240934 10267083 11537966 13079460 ...
## $ gdpPercap : num 779 821 853 836 740 ...
## $ infant_mortality: num NA NA NA NA NA NA NA NA NA NA ...
## $ fertility : num NA NA NA NA NA NA NA NA NA NA ...
## - attr(*, "spec")=
## .. cols(
## .. country = col_character(),
## .. continent = col_character(),
## .. year = col_double(),
## .. lifeExp = col_double(),
## .. pop = col_double(),
## .. gdpPercap = col_double(),
## .. infant_mortality = col_double(),
## .. fertility = col_double()
## .. )
*You’ll notice from above that we can select columns by using the dollar sign $.
Run the following lines of code to answer the questions below
dim(gapminder)
names(gapminder)
*Given spelling is so important in R, names() is a handy way to check the names of our columns.
head(gapminder)
tail(gapminder)
summary(gapminder)
unique(gapminder$year)
unique(gapminder$country)
Battleship
Whenever I think of R dataframes I think of the game battleship. In battleship, to strike the other opponent’s ships you launch missiles by giving a row and column reference for the location to hit on your opponent’s board.
Data frames are much the same. We can extract an element by specifying the rows and the columns:
data_frame[rows, columns]If we wanted to get the first value from the first row and column in the dataframe we could use:
gapminder[1,1]
## # A tibble: 1 x 1
## country
## <fct>
## 1 Afghanistan
If we wanted the whole first row we could use:
gapminder[1,]
## # A tibble: 1 x 8
## country continent year lifeExp pop gdpPercap infant_mortality
## <fct> <fct> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 Afghan~ Asia 1952 28.8 8.43e6 779. NA
## # ... with 1 more variable: fertility <dbl>
Notice, that if we want to select all columns we simply add a comma and leave the column position blank.
What happens if you run the following?
gapminder[1]
If we wanted the first 5 rows and the first and third columns we could use:
gapminder[1:5, c(1,3)]
## # A tibble: 5 x 2
## country year
## <fct> <dbl>
## 1 Afghanistan 1952
## 2 Afghanistan 1957
## 3 Afghanistan 1962
## 4 Afghanistan 1967
## 5 Afghanistan 1972
Remember from before that if we have nonconsecutive positions, we need to use the c() function to combine these positions into a list.
We can also reference the column by name:
gapminder[1:5, c("country", "gdpPercap")]
## # A tibble: 5 x 2
## country gdpPercap
## <fct> <dbl>
## 1 Afghanistan 779.
## 2 Afghanistan 821.
## 3 Afghanistan 853.
## 4 Afghanistan 836.
## 5 Afghanistan 740.
Why might this be preferred to referencing columns by number?
pop and save it in a new object called popHint: look back to how we selected row 1 and all columns.
pop <- gapminder[]
gapminder[]
Bonus
gdpPercap and pop.Hint: look back to how we selected row 1 and all columns.
gapminder[]
lifeExp and save it in a new data frame called sub_lifeExp <- gapminder[]
dplyr:So far I’ve shown you the ‘old school’ method for extracting and filtering data. It is useful to know the layout of vectors and dataframes, especially if you end up writing your own for loops or functions in the future.
However, the package, dplyr, has made a lot of data manipulation easier and clearer using verbs to filter and select different elements.
select() subsets columns based on their names.filter() subsets rows based on their values.summarise() calculates summary statistics.group_by() groups variable for summarising.mutate() adds new columns that are functions of existing variables.These verbs can be combined in powerful ways to do some really interesting data manipulation tasks.
select(gapminder, lifeExp, pop)
## # A tibble: 1,704 x 2
## lifeExp pop
## <dbl> <dbl>
## 1 28.8 8425333
## 2 30.3 9240934
## 3 32.0 10267083
## 4 34.0 11537966
## 5 36.1 13079460
## 6 38.4 14880372
## 7 39.9 12881816
## 8 40.8 13867957
## 9 41.7 16317921
## 10 41.8 22227415
## # ... with 1,694 more rows
These verbs can be used by specifying the data frame first, or using the pipe operator %>%. You can think of the the pipe operator as meaning “and then”.
gapminder %>%
select(lifeExp, country)
## # A tibble: 1,704 x 2
## lifeExp country
## <dbl> <fct>
## 1 28.8 Afghanistan
## 2 30.3 Afghanistan
## 3 32.0 Afghanistan
## 4 34.0 Afghanistan
## 5 36.1 Afghanistan
## 6 38.4 Afghanistan
## 7 39.9 Afghanistan
## 8 40.8 Afghanistan
## 9 41.7 Afghanistan
## 10 41.8 Afghanistan
## # ... with 1,694 more rows
One big advantage of the pipe operator is that it does not change your raw data in any way!
head(gapminder)
## # A tibble: 6 x 8
## country continent year lifeExp pop gdpPercap infant_mortality
## <fct> <fct> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 Afghan~ Asia 1952 28.8 8.43e6 779. NA
## 2 Afghan~ Asia 1957 30.3 9.24e6 821. NA
## 3 Afghan~ Asia 1962 32.0 1.03e7 853. NA
## 4 Afghan~ Asia 1967 34.0 1.15e7 836. NA
## 5 Afghan~ Asia 1972 36.1 1.31e7 740. NA
## 6 Afghan~ Asia 1977 38.4 1.49e7 786. NA
## # ... with 1 more variable: fertility <dbl>
This is really useful because it means you can manipulate your data without having to store new data frames for each step. It also means you never comprimise the original data.
You can also assign your output to a new data frame.
lifeExp_by_country <- gapminder %>%
select(lifeExp, country)
head(lifeExp_by_country)
## # A tibble: 6 x 2
## lifeExp country
## <dbl> <fct>
## 1 28.8 Afghanistan
## 2 30.3 Afghanistan
## 3 32.0 Afghanistan
## 4 34.0 Afghanistan
## 5 36.1 Afghanistan
## 6 38.4 Afghanistan
gapminder %>%
select(-c(lifeExp, country))
country, continent and gdpPercap from the data frame.gapminder %>%
Extra Credit
gapminder %>%
filter: subsetting rowsFor filtering it is useful to know your set of operators:
| Logical Operator | Description |
|---|---|
| < | Less Than |
| <= | Less Than or Equal To |
| > | Greater Than |
| >= | Greater Than or Equal To |
| == | Equal To |
| != | Not Equal To |
| | | Or |
| & | And |
| %in% c(….) | Membership one in a list of elements |
(Ignore backslashes in the notebook.)
We can use filter to pick out a particular country. N.B., if we are unsure of names we can always use unique(gapminder$country) to check spellings.
==gapminder %>%
filter(country == "Yemen, Rep.")
## # A tibble: 12 x 8
## country continent year lifeExp pop gdpPercap infant_mortality
## <fct> <fct> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 Yemen,~ Asia 1952 32.5 4.96e6 782. NA
## 2 Yemen,~ Asia 1957 34.0 5.50e6 805. NA
## 3 Yemen,~ Asia 1962 35.2 6.12e6 826. NA
## 4 Yemen,~ Asia 1967 37.0 6.74e6 862. NA
## 5 Yemen,~ Asia 1972 39.8 7.41e6 1265. NA
## 6 Yemen,~ Asia 1977 44.2 8.40e6 1830. NA
## 7 Yemen,~ Asia 1982 49.1 9.66e6 1978. NA
## 8 Yemen,~ Asia 1987 52.9 1.12e7 1972. NA
## 9 Yemen,~ Asia 1992 55.6 1.34e7 1879. NA
## 10 Yemen,~ Asia 1997 58.0 1.58e7 2117. NA
## 11 Yemen,~ Asia 2002 60.3 1.87e7 2235. NA
## 12 Yemen,~ Asia 2007 62.7 2.22e7 2281. NA
## # ... with 1 more variable: fertility <dbl>
We can also use filter to filter rows from a set of countries of interest
gapminder %>%
filter(country %in% c("Morocco", "Algeria", "Libya", "Tunisia", "Egypt", "Sudan", "Jordan", "Oman", "Lebanon", "Israel", "Syria", "Yemen, Rep."))
## # A tibble: 144 x 8
## country continent year lifeExp pop gdpPercap infant_mortality
## <fct> <fct> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 Algeria Africa 1952 43.1 9.28e6 2449. NA
## 2 Algeria Africa 1957 45.7 1.03e7 3014. NA
## 3 Algeria Africa 1962 48.3 1.10e7 2551. 148.
## 4 Algeria Africa 1967 51.4 1.28e7 3247. 149.
## 5 Algeria Africa 1972 54.5 1.48e7 4183. 141.
## 6 Algeria Africa 1977 58.0 1.72e7 4910. 119
## 7 Algeria Africa 1982 61.4 2.00e7 5745. 84.6
## 8 Algeria Africa 1987 65.8 2.33e7 5681. 46.3
## 9 Algeria Africa 1992 67.7 2.63e7 5023. 38.1
## 10 Algeria Africa 1997 69.2 2.91e7 4797. 35.1
## # ... with 134 more rows, and 1 more variable: fertility <dbl>
You can add multiple filters with a comma.
gapminder %>%
filter(country == "Yemen, Rep.", year >= 1960 & year <= 1985)
## # A tibble: 5 x 8
## country continent year lifeExp pop gdpPercap infant_mortality
## <fct> <fct> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 Yemen,~ Asia 1962 35.2 6.12e6 826. NA
## 2 Yemen,~ Asia 1967 37.0 6.74e6 862. NA
## 3 Yemen,~ Asia 1972 39.8 7.41e6 1265. NA
## 4 Yemen,~ Asia 1977 44.2 8.40e6 1830. NA
## 5 Yemen,~ Asia 1982 49.1 9.66e6 1978. NA
## # ... with 1 more variable: fertility <dbl>
gapminder %>%
filter(continent == "Europe", lifeExp > 70)
gapminder %>%
gapminder %>%
Extra Credit
%in% to get the countries “Chile”, “Argentina”, “Uruguay”, and “Peru” and only years greater than or equal to 1992.gapminder %>%
!= to include the data from all continents apart from Europe.gapminder %>%
summarise() uses existing R functions to calculate summary statistics.summarise()For instance we may wish to calculate the mean lifeExp for all countries:
(lifeExp_stats <- gapminder %>%
summarise(mean_lifeExp = mean(lifeExp)))
## # A tibble: 1 x 1
## mean_lifeExp
## <dbl>
## 1 59.5
We can also calculate multiple summary statistics at the same time, separating each new summary variable with a ,. This way we can calculate the mean, min, and max lifeExp for all countries combined:
(lifeExp_stats <- gapminder %>%
summarise(
mean_lifeExp = mean(lifeExp), # mean
min_lifeExp = min(lifeExp), # min
max_lifeExp = max(lifeExp)) # max
)
## # A tibble: 1 x 3
## mean_lifeExp min_lifeExp max_lifeExp
## <dbl> <dbl> <dbl>
## 1 59.5 23.6 82.6
group_by() used to group variables. Can be especially useful before summarising.(lifeExp_stats_country <- gapminder %>%
group_by(country) %>%
summarise(
mean_lifeExp = mean(lifeExp),
min_lifeExp = min(lifeExp),
max_lifeExp = max(lifeExp)
))
## # A tibble: 142 x 4
## country mean_lifeExp min_lifeExp max_lifeExp
## <fct> <dbl> <dbl> <dbl>
## 1 Afghanistan 37.5 28.8 43.8
## 2 Albania 68.4 55.2 76.4
## 3 Algeria 59.0 43.1 72.3
## 4 Angola 37.9 30.0 42.7
## 5 Argentina 69.1 62.5 75.3
## 6 Australia 74.7 69.1 81.2
## 7 Austria 73.1 66.8 79.8
## 8 Bahrain 65.6 50.9 75.6
## 9 Bangladesh 49.8 37.5 64.1
## 10 Belgium 73.6 68 79.4
## # ... with 132 more rows
gapminder %>%
group_by(continent, year) %>%
summarise(mean_gdpPercap = mean(gdpPercap))
gapminder %>%
Bonus
gapminder %>%
We’ve seen an example of the pipe function %>% in the group_by() example above. The pipe function allows you to combine multiple data wrangling steps which will be carried out in order.
I like to think of the pipe function as the separator of different jobs on an assembly line.
You begin with your raw data (e.g. tree), it then goes through the pipe to the next station where it is modified in some way (e.g. cut into planks), it can then pass to another station where it can be further modified, and so on and so forth, until Voila! you have your final product (e.g. a bird house).
Let’s say we are interested in calculating the life expectancy in Yemen pre 1980. We can run the following:
yemen_pre1980_mean_lifeExp <- gapminder %>%
filter(country == "Yemen, Rep.", year <1980) %>% # Return data for Yemen pre 1980
select(lifeExp) %>% # Select the column lifeExp (life expectancy)
summarise(meanlifeExp = mean(lifeExp)) # Calculate mean life expectancy
We can also combine multiple operators and look at a slice of the data.
slice() chooses rows by their position within the group. In this case we are selecting out the minimum life Expectancy.
gapminder %>%
group_by(year) %>%
slice(which.min(lifeExp))
## # A tibble: 12 x 8
## # Groups: year [12]
## country continent year lifeExp pop gdpPercap infant_mortality
## <fct> <fct> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 Afghan~ Asia 1952 28.8 8.43e6 779. NA
## 2 Afghan~ Asia 1957 30.3 9.24e6 821. NA
## 3 Afghan~ Asia 1962 32.0 1.03e7 853. NA
## 4 Afghan~ Asia 1967 34.0 1.15e7 836. NA
## 5 Sierra~ Africa 1972 35.4 2.88e6 1354. 185.
## 6 Cambod~ Asia 1977 31.2 6.98e6 525. 155.
## 7 Sierra~ Africa 1982 38.4 3.46e6 1465. 164.
## 8 Angola Africa 1987 39.9 7.87e6 2430. 134.
## 9 Rwanda Africa 1992 23.6 7.29e6 737. 101.
## 10 Rwanda Africa 1997 36.1 7.21e6 590. 122.
## 11 Zambia Africa 2002 39.2 1.06e7 1072. 86.5
## 12 Swazil~ Africa 2007 39.6 1.13e6 4513. 74.7
## # ... with 1 more variable: fertility <dbl>
We can also see which country had the highest life Expectancy in each year.
gapminder %>%
group_by(year) %>%
slice(which.max(lifeExp))
## # A tibble: 12 x 8
## # Groups: year [12]
## country continent year lifeExp pop gdpPercap infant_mortality
## <fct> <fct> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 Norway Europe 1952 72.7 3.33e6 10095. NA
## 2 Iceland Europe 1957 73.5 1.65e5 9244. NA
## 3 Iceland Europe 1962 73.7 1.82e5 10350. 16.9
## 4 Sweden Europe 1967 74.2 7.87e6 15258. 12.6
## 5 Sweden Europe 1972 74.7 8.12e6 17832. 10.4
## 6 Iceland Europe 1977 76.1 2.22e5 19655. 9.2
## 7 Japan Asia 1982 77.1 1.18e8 19384. 6.5
## 8 Japan Asia 1987 78.7 1.22e8 22376. 5
## 9 Japan Asia 1992 79.4 1.24e8 26825. 4.4
## 10 Japan Asia 1997 80.7 1.26e8 28817. 3.8
## 11 Japan Asia 2002 82 1.27e8 28605. 3
## 12 Japan Asia 2007 82.6 1.27e8 31656. 2.6
## # ... with 1 more variable: fertility <dbl>
mutate() adds new columns that are functions of existing variables.Using the verb mutate() we can create a new data column called gdp. In this case the per capita GDP gdpPercap needs to be multiplied by the population pop to get the overall GDP.
(gapminder<- gapminder %>%
mutate(gdp = gdpPercap*pop))
## # A tibble: 1,704 x 9
## country continent year lifeExp pop gdpPercap infant_mortality
## <fct> <fct> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 Afghan~ Asia 1952 28.8 8.43e6 779. NA
## 2 Afghan~ Asia 1957 30.3 9.24e6 821. NA
## 3 Afghan~ Asia 1962 32.0 1.03e7 853. NA
## 4 Afghan~ Asia 1967 34.0 1.15e7 836. NA
## 5 Afghan~ Asia 1972 36.1 1.31e7 740. NA
## 6 Afghan~ Asia 1977 38.4 1.49e7 786. NA
## 7 Afghan~ Asia 1982 39.9 1.29e7 978. NA
## 8 Afghan~ Asia 1987 40.8 1.39e7 852. NA
## 9 Afghan~ Asia 1992 41.7 1.63e7 649. NA
## 10 Afghan~ Asia 1997 41.8 2.22e7 635. NA
## # ... with 1,694 more rows, and 2 more variables: fertility <dbl>,
## # gdp <dbl>
This is useful if we want to look at the overall gdp, but it is also a huge number which is difficult to compare among countries in a meaningful way.
It is often the case that our data is spread out over several data frames that we are interested in combining. We can join these data frames together using a variety of join functions from the dplyr package.
Let’s walk through the different types of joins using a simple example.
Let’s say we have two data frames of “tables” we are interesting in joining together: person_table, which contains the information about the employee (Person_ID, Name and Job_ID) and the job_table, which contains information about the job (Job_ID and Job_Name). We can join the two table on the matched ID column Job_ID.
Person Table
person_table <- data.frame(Person_ID = c("Person1", "Person2"), Name = c("Jane Doe", "John Smith"), Job_ID = c("Job_1", "NA"))
person_table
## Person_ID Name Job_ID
## 1 Person1 Jane Doe Job_1
## 2 Person2 John Smith NA
Job Table
job_table <- data.frame(Job_ID = c("Job_1", "Job_2"), Job_Name = c("Programmer", "Statistician"))
job_table
## Job_ID Job_Name
## 1 Job_1 Programmer
## 2 Job_2 Statistician
With an inner join, rows where there’s a match on the join criteria are returned. Unmatched rows are excluded. Don’t worry about the warning message. It is just pointing out that the column Job_ID in the person table has
inner_join(x = person_table, y = job_table, by = "Job_ID")
## Warning: Column `Job_ID` joining factors with different levels, coercing to
## character vector
## Person_ID Name Job_ID Job_Name
## 1 Person1 Jane Doe Job_1 Programmer
With a left join, you get all rows from the left side of the join even if there are no matching rows on the right side. You only get rows from the right side where there’s a join match to a row on the left.
left_join(x = person_table, y = job_table, by = "Job_ID")
## Warning: Column `Job_ID` joining factors with different levels, coercing to
## character vector
## Person_ID Name Job_ID Job_Name
## 1 Person1 Jane Doe Job_1 Programmer
## 2 Person2 John Smith NA <NA>
With a right join, you get all the rows from the left side of the join only where there’s a match on the right. You get all rows from the right side of the join even if there are no matching rows on the left.
right_join(x = person_table, y = job_table, by = "Job_ID")
## Warning: Column `Job_ID` joining factors with different levels, coercing to
## character vector
## Person_ID Name Job_ID Job_Name
## 1 Person1 Jane Doe Job_1 Programmer
## 2 <NA> <NA> Job_2 Statistician
With a full join, you get all rows from the left and right hand side, joined where the criteria matches.
full_join(x = person_table, y = job_table, by = "Job_ID")
## Warning: Column `Job_ID` joining factors with different levels, coercing to
## character vector
## Person_ID Name Job_ID Job_Name
## 1 Person1 Jane Doe Job_1 Programmer
## 2 Person2 John Smith NA <NA>
## 3 <NA> <NA> Job_2 Statistician
uk_gdpPercap_dfCreating the new data frame uk_gdpPercap_df
To look at the per capita GDP in a way that’s more meaningful, let’s create a new variable gdpPercap_rel, that is the gdpPercap of the country relative to the United Kindom gdpPercap of that same year.
We can do this by dividing gdpPercap by the United Kingdom’s gdpPercap, making sure that we always divide two numbers that are from the same year. To do this we need to first:
uk_gdpPercap_dfcountry == "United Kingdom".gdpPercap and year.gdpPercap, uk_gdpPercap.uk_gdpPercap_df <- gapminder %>%
filter(country == "United Kingdom") %>%
select(gdpPercap, year) %>%
rename(uk_gdpPercap = gdpPercap)
head(uk_gdpPercap_df)
## # A tibble: 6 x 2
## uk_gdpPercap year
## <dbl> <dbl>
## 1 9980. 1952
## 2 11283. 1957
## 3 12477. 1962
## 4 14143. 1967
## 5 15895. 1972
## 6 17429. 1977
We want to divide all the other gdpPercap by the UK gdpPercap in that same year.
One way we can do this is to match the two data frames using a left_join on the common variable, year. This will effectively make a new column, for the uk_gdpPercap that is joined up to our gapminder data frame.
A left_join keeps all of the rows from the first data frame (x = gapminder) and on the matching rows from the other data frame (y = uk_gdpPercap_df), using the values in the column year to do the matching (by = "year").
gapminder <- left_join(gapminder, uk_gdpPercap_df, by = "year")
head(gapminder[, c("country", "year", "uk_gdpPercap", "gdpPercap")])
## # A tibble: 6 x 4
## country year uk_gdpPercap gdpPercap
## <fct> <dbl> <dbl> <dbl>
## 1 Afghanistan 1952 9980. 779.
## 2 Afghanistan 1957 11283. 821.
## 3 Afghanistan 1962 12477. 853.
## 4 Afghanistan 1967 14143. 836.
## 5 Afghanistan 1972 15895. 740.
## 6 Afghanistan 1977 17429. 786.
Now that we have the gdpPercap and uk_gdpPercap matched up, we can can calculate the relative GDP per capita gdpPercap_rel.
gapminder <- gapminder %>%
mutate(gdpPercap_rel = gdpPercap/uk_gdpPercap)
We can doublecheck that our calculation worked by filtering for the United Kingdom to check that the relative gdp per capita is 1.
gapminder %>%
filter(country == "United Kingdom") %>%
select(gdpPercap_rel) %>%
head()
## # A tibble: 6 x 1
## gdpPercap_rel
## <dbl>
## 1 1
## 2 1
## 3 1
## 4 1
## 5 1
## 6 1
How many countries had a smaller gdp per capita than the UK each year?
gapminder %>%
group_by(year) %>%
filter(gdpPercap_rel <= 1) %>%
summarise(count = n())
## # A tibble: 12 x 2
## year count
## <dbl> <int>
## 1 1952 135
## 2 1957 135
## 3 1962 132
## 4 1967 128
## 5 1972 125
## 6 1977 124
## 7 1982 124
## 8 1987 127
## 9 1992 124
## 10 1997 127
## 11 2002 127
## 12 2007 126
gapminder %>%
select(country, gdpPercap_rel) %>%
filter(country %in% c("Argentina", "Chile", "Peru", "Brazil")) %>%
group_by(country) %>%
summarise(
max_gdp = max(gdpPercap_rel),
min_gdp = min(gdpPercap_rel),
mean_gdp = mean(gdpPercap_rel)
)
gapminder %>%
group_by() %>%
filter(gdpPercap_rel > 1) %>%
summarise(count = n())
gapminder %>%
filter(<BLANK>) %>%
select(country) %>%
unique()
Using what we’ve learned so far, let’s go back to our original comparisons.
Which of the three pairs of countries do you think have a higher infant mortality rate in 2007? Which are the most similar?
gapminder %>%
filter(year == 2007, country %in% c("Sri Lanka", "Turkey")) %>%
select(country, infant_mortality)
gapminder %>%
filter(year == 2007, country %in% c("Poland", "Malaysia")) %>%
select(country, infant_mortality)
gapminder %>%
filter(year == 2007, country %in% c("Pakistan", "Vietnam")) %>%
select(country, infant_mortality)
Which of the two pairs of countries do you think have a higher life Expectancy in 2007? Which are the most similar?
gapminder %>%
filter(year == 2007, country %in% c("South Africa", "Yemen, Rep.")) %>%
select(country, lifeExp)
## # A tibble: 2 x 2
## country lifeExp
## <fct> <dbl>
## 1 South Africa 49.3
## 2 Yemen, Rep. 62.7
gapminder %>%
filter(year == 2007, country %in% c("Chile", "Hungary")) %>%
select(country, lifeExp)
For the two pairs of countries below, which country do you think had the highest gdpPercap in 2007?
gapminder %>%
filter(year == 2007, country %in% c("Switzerland", "Kuwait")) %>%
select(country, gdpPercap)
gapminder %>%
filter(year == 2007, country %in% c("Colombia", "Nepal")) %>%
select(country, gdpPercap)
Which results did you find the most surprising?
One of the most meaningful ways to interpret and make sense of data is through plotting! Plotting the data allows us to look for relationships between variables, generate hypotheses, and identified patterns. A great package to make attractive graphics is ggplot2.
Let’s start by making a scatter plot of life Expectancy by year for a handful of countries in the middle east.
First we can make a new dataframe called gapminder_middle_east
middle_east <- c("Israel", "Jordan", "Oman", "Yemen, Rep.")
gapminder_middle_east <- gapminder %>%
filter(country %in% middle_east)
Then we can make a scatter plot in ggplot2 using the function geom_point plotting year on the x axis and lifeExp on the y axis.
ggplot(data = gapminder_middle_east) +
geom_point(mapping = aes(x = year, y = lifeExp))
To make a plot with ggplot2 you begin a plot with the function ggplot():
ggplot()The first argument of ggplot() is the dataset to use in the graph:
ggplot(data = gapminder_middle_east)You complete your graph by adding one or more layers to ggplot().
geom_point().The function geom_point() adds a layer of points to your plot. Each geom function in ggplot2 takes a mapping argument which defines how variables in your dataset are mapped to visual properties. The mapping argument is always paired with aes(). In the case of geom_point the x and y arguments of aes() specify which variables to map to the x and y axes.
geom_point(mapping = aes(x = year, y = lifeExp)).When these are specified, ggplot2 looks for the mapped variables (year and lifeExp) in the data argument.
Graphs in ggplot take the following form
ggplot(data = <DATA>) +
<GEOM_FUNCTION>(mapping = aes(<MAPPINGS>))
Depending on the <GEOM_FUNCTION> used the arguments may vary. For instance if we are plotting a histogram to look at the range of life Expectancy in the dataset, we only need to provide a variable for the x axis. We also need to provide a value for the argument bins().
ggplot(data = gapminder) +
geom_histogram(mapping = aes(x = lifeExp), bins = 25)
Take a look at what different plots are available by typing geom_ and then tab.
You can add a third variable, like country, to a two dimensional scatterplot by mapping it to an aesthetic. Aesthetics are visual properties of the objects in your plot. Aesthetics include things like the size, the shape, or the color of your points. You can display a point (like the one below) in different ways by changing the values of its aesthetic properties.
It seems like overall, life expectancy (lifeExp) has been improving in most countries with time, but some are improving faster than others. We can add additional information to the aes argument to explore the data further. For instance, we can colour the points by country.
ggplot(data = gapminder_middle_east) +
geom_point(mapping = aes(x = year, y = lifeExp, colour = country))
This makes the graph a little easier to read, but some of the colours blend together. We can add an additional argument to change the shape of the point as well.
ggplot(data = gapminder_middle_east) +
geom_point(mapping = aes(x = year, y = lifeExp, colour = country, shape = country))
We can also change the size, making it equal to gdpPercap
ggplot(data = gapminder_middle_east) +
geom_point(mapping = aes(x = year, y = lifeExp, colour = country, size = gdpPercap))
In this case ggplot gives us two legends, one for the size of the points and one for the country colour. Most of the countries gdpPercap has been increasing overtime, although some increases are more slight than others.
We could also make the plot with the points sized by relative gdp per capita gdpPercap_rel
ggplot(data = gapminder_middle_east) +
geom_point(mapping = aes(x = year, y = lifeExp, colour = country, size = gdpPercap_rel))
We can customise our graph further by adding titles and labels.
ggplot(data = gapminder_middle_east) +
geom_point(mapping = aes(x = year, y = lifeExp, colour = country, size = gdpPercap)) +
ggtitle("Life Expectancy by Year") +
labs(x = "Year", y = "Life Expectancy")
We can also change the limits of our x and y axes. Generally it is a good idea to start axes from 0.
ggplot(data = gapminder_middle_east) +
geom_point(mapping = aes(x = year, y = lifeExp, colour = country, size = gdpPercap)) +
ggtitle("Life Expectancy by Year") +
labs(x = "Year", y = "Life Expectancy") +
ylim(0, 100)
We can also change the labels of our legend.
ggplot(data = gapminder_middle_east) +
geom_point(mapping = aes(x = year, y = lifeExp, colour = country, size = gdpPercap)) +
ggtitle("Life Expectancy by Year") +
labs(x = "Year", y = "Life Expectancy", colour = "Country", size = "GDP Per Capita") +
ylim(0, 100)
facet_wrap()The function facet_wrap() wraps a series of plot panels into two dimensions. We can use it in our plot to make a plot panel for each country. There are other options for facet_wrap, take a look at the help file by typing ?facet_wrap to look at other examples like wrapping the data by two variables.
p1 <- ggplot(data = gapminder_middle_east) +
geom_point(mapping = aes(x = year, y = lifeExp, colour = country, size = gdpPercap)) +
ggtitle("Life Expectancy by Year") +
labs(x = "Year", y = "Life Expectancy", size = "GDP Per Capita", colour = "Country") +
ylim(0, 100)
p1 + facet_wrap(~country, ncol = 2)
And to save the last plot we made, we can run the following lines of code.
ggsave()ggsave(filename = "pictures/Life_Expectancy_by_Year.png", width = 6, height = 4)
From this plot it seems like the countries with the largest gdpPercap seem to overall have higher life Expectancy.
Time series plots are a great way to look at the evolution of a process through time. We can use a time series plot to ask the questions:
gapminder %>%
filter(country %in% c("Colombia", "Chile", "Argentina", "Brazil", "Peru", "Ecuador")) %>%
ggplot() +
geom_line(mapping = aes(x = year, y = gdpPercap, colour = country)) +
labs(x = "Year", y = "GDP Per Capita", colour = "Country") +
ylim(0, 15000)
Overall all the South American country’s in the plot above GDP per capita have increased over time. But how does this compare to how the UK’s gdp per capita changed?
gapminder %>%
filter(country %in% c("Colombia", "Chile", "Argentina", "Brazil", "Peru", "Ecuador")) %>%
ggplot() +
geom_line(mapping = aes(x = year, y = gdpPercap_rel, colour = country)) +
labs(x = "Year", y = "GDP Per Capita Relative to the UK", colour = "Country")
gapminder %>%
filter(country %in% c("Colombia", "Chile", "Argentina", "Brazil", "Peru", "Ecuador")) %>%
ggplot() +
geom_line(mapping = aes(x = year, y = infant_mortality, colour = country)) +
ylim(0, 150) +
labs(y = "Infant Mortality", x = "Year", colour = "Country")
## Warning: Removed 12 rows containing missing values (geom_path).
What kind of trends can you pick out through time? Which country’s fertility dropped the fastest? Which country’s fertility changed the least? When do we start to have data for fertility from these countries?
gapminder %>%
filter(country %in% c("Colombia", "Chile", "Argentina", "Brazil", "Peru", "Ecuador")) %>%
ggplot() +
geom_line(mapping = aes(x = year, y = fertility, colour = country)) +
ylim(0, 10) +
labs(y = "Fertility", x = "Year", colour = "Country") +
ggtitle("Fertility over Time")
## Warning: Removed 12 rows containing missing values (geom_path).
ggtitle().gapminder %>%
filter(continent == "Americas", year %in% c(1952, 2007)) %>%
mutate(year = as.factor(year)) %>%
ggplot() +
geom_point(mapping = aes(y = country, x = lifeExp, colour = year)) +
labs(x = "Life Expectancy", y = "Country", colour = "Year")
Change the x = fct_reorder(country, life_exp_diff) to x = country. What does fct_reorder do? Take a look at ?fct_reorder for more info.
Rerun the plot, this time removing coord_flip(). What does the function coord_flip() change in the plot?
gap_lifeExpdiff_df <- gapminder %>%
group_by(country) %>%
summarise(life_exp_diff = max(lifeExp) - min(lifeExp)) %>%
top_n(n = 10)
## Selecting by life_exp_diff
p1 <- ggplot(gap_lifeExpdiff_df) +
geom_col(mapping = aes(x = fct_reorder(country, life_exp_diff), y = life_exp_diff), fill = "blue") +
labs(y = "Difference in Maximum and Minimum Life Expectancy (years)", x = "") +
ggtitle("Difference in Maximum and Minimum Life Expectancy", sub = "Top 10 countries with the largest difference (1952-2007)") +
ylim(0, 40)
p1 + coord_flip()
gapminder %>%
filter(country %in% c(<BLANK>)) %>%
select(year, pop, country) %>%
mutate(pop = pop/1000000) %>%
ggplot() +
geom_point(mapping = aes(x = <BLANK>, y = <BLANK>, colour = <BLANK>)) +
facet_wrap(country ~ .) +
ggtitle("Population in Argentina, Chile, Peru, and Uruguay") +
labs(x = "Year", y = "Population in Millions", colour = "Country")
Bonus
Hint: You’ll need to change top_n(), take a look at the help file using ?top_n and read what it says for the argument n.
sub = to reflect that we’re looking at the countries with the smallest difference.gap_lifeExpdiff_df <- gapminder %>%
group_by(country) %>%
summarise(life_exp_diff = max(lifeExp) - min(lifeExp)) %>%
top_n(n = 10)
## Selecting by life_exp_diff
p1 <- ggplot(gap_lifeExpdiff_df) +
geom_col(mapping = aes(x = fct_reorder(country, life_exp_diff), y = life_exp_diff), fill = "blue") +
labs(y = "Difference in Maximum and Minimum Life Expectancy (years)", x = "") +
ggtitle("Difference in Maximum and Minimum Life Expectancy", sub = "Top 10 countries with the largest difference (1952-2007)") +
ylim(0, 40)
p1 + coord_flip()
gapminder %>%
ggplot() +
geom_line(mapping = aes(x = year,
y = <BLANK>,
group = country,
colour = <BLANK>)) +
labs(y = "Infant Mortality", x = "Year", colour = "Continent") +
facet_wrap(. ~ <BLANK>)
5a. Filter the data to find out which countries in Europe had a infant mortality rate greater than 60?
N.B. You do not need to make a plot.
gapminder %>%
filter(continent == "Europe", infant_mortality > <BLANK>)
gapminder %>%
ggplot() +
geom_line(mapping = aes(x = year,
y = lifeExp,
group = country,
colour = continent)) +
ylim(0, 100) +
labs(y = "Life Expectancy", x = "Year", colour = "Continent") +
facet_wrap(continent ~ .)
Which country is represented in the dip in Africa?
Which country is represented in the dip in Asia?
? or vignette respectively.?filter
vignette("dplyr")
CRAN Task View Looking for a package to carry out a particular analysis? Check out CRAN Task View
Stack Overflow Stack Overflow Check out Stack Overflow. This is one of the first calls where members from the R Community will help you answer questions.
Cheatsheets Many of the tidyverse packages come with their own cheatsheets, which are a quick reference on how to use various functions. It also gives a good overview of what functions are available.
Google. Google is your friend! Type “R help” followed by the warning or error message you received and I guarantee there will be someone who has had this problem before.
Meet ups and coding clubs Join a meet up or coffee and code group. Check out R-Ladies.
Further resources Looking to develop your learning further? Check out my trello board on R Resources for Data Science. This is still a work in progress, but I’m continually updating it with useful resources.
seq() as an example. Stat 545 University of British Columbia Blog by Jenny BryanThank you to Jhai Ghaghada for laying the foundation for the Intro to R course. Thanks to Andrew Meechan, Rebecca Brown, David Bell, and Lewis Dunne for being the guinea pigs for this work. Special thanks to Rebecca Brown for the comments and feedback on the content.